[SPARK-54462][SQL] Add `isDataFrameWriterV1` option for Delta datasource compatibility #53173

juliuszsompolski · 2025-11-22T18:04:04Z

What changes were proposed in this pull request?

Make DataFrameWriter saveAsTable add a writeOption isDataFrameWriterV1 = true when using Overwrite mode with a delta Data source.
This is an emergency fix to prevent a breaking change resulting in data corruption with Delta data sources in Spark 4.1.

Why are the changes needed?

Spark's SaveMode.Overwrite is documented as:

        * if data/table already exists, existing data is expected to be overwritten
        * by the contents of the DataFrame.

It does not define the behaviour of overwriting the table metadata (schema, etc). Delta datasource interpretation of this API documentation of DataFrameWriter V1 is to not replace table schema, unless Delta-specific option "overwriteSchema" is set to true.

However, DataFrameWriter V1 creates a ReplaceTableAsSelect plan, which is the same as the plan of DataFrameWriterV2 createOrReplace API, which is documented as:

       * The output table's schema, partition layout, properties, and other configuration
       * will be based on the contents of the data frame and the configuration set on this
       * writer. If the table exists, its configuration and data will be replaced.

Therefore, for calls via DataFrameWriter V2 createOrReplace, the metadata always needs to be replaced, and Delta datasource doesn't use the overwriteSchema option.
Since the created plan is exactly the same, Delta had used a very ugly hack to detect where the API call is coming from based on the stack trace of the call.

In Spark 4.1 in connect mode, this stopped working because planning and execution of the commands go decoupled, and the stack trace no longer contains this point where the plan got created.

To retain compatibility of the Delta datasource with Spark 4.1 in connect mode, Spark provides this explicit storage option to indicate to Delta datasource that this call is coming from DataFrameWriter V1.

Followup: Since the details of the documented semantics of Spark's DataFrameWriter V1 saveAsTable API differs from that of CREATE/REPLACE TABLE AS SELECT, Spark should not be reusing the exact same logical plan for these APIs.
Existing Datasources which have been implemented following Spark's documentation of these APIs should have a way to differentiate between these APIs.

However, at this point this is an emergency fix, as releasing Spark 4.1 as is would cause data corruption issues with Delta in DataFrameWriter saveAsTable in overwrite mode, as it would not be correctly interpreting it's overwriteSchema mode.

Does this PR introduce any user-facing change?

No

How was this patch tested?

It has been tested with tests that are not part of the PR. To properly test in connect mode, changes are needed on both Spark and Delta side and integrating it will be done as followup work.

Was this patch authored or co-authored using generative AI tooling?

Assisted by Claude Code.
Generated-by: claude code, model sonnet 4.5

juliuszsompolski · 2025-11-22T18:05:14Z

@bart-samwel @hvanhovell @cloud-fan

dongjoon-hyun

Thank you for informing the issue, @juliuszsompolski . I have a few questions.

Which Apache Spark preview version and Delta version did you test this? Specifically, which preview did this start to happen?
Why don't we implement this in io.delta.sql.DeltaSparkSessionExtension or document it instead of changing Apache Spark source code?
Do you think we can have a test coverage with a dummy data source?
If this is an emergency fix, what would be the non-emergency fix?

This is an emergency fix to prevent a breaking change resulting in data corruption with Delta data sources in Spark

dongjoon-hyun · 2025-11-23T00:02:45Z

sql/core/src/main/scala/org/apache/spark/sql/classic/DataFrameWriter.scala

          serde = None,
          external = false,
          constraints = Seq.empty)
+        val writeOptions = if (source == "delta") {


Does Apache Spark source code have this kind delta-specific logic before, @juliuszsompolski ?

This looks like the first proposal to have a 3rd-party company data source in Apache Spark source code. At the first glance, this string match looks a little fragile to me.

HyukjinKwon · 2025-11-23T23:59:43Z

sql/core/src/main/scala/org/apache/spark/sql/classic/DataFrameWriter.scala

+          // To retain compatibility of the Delta datasource with Spark 4.1 in connect mode, Spark
+          // provides this explicit storage option to indicate to Delta datasource that this call
+          // is coming from DataFrameWriter V1.
+          //


Per:

// FIXME: Since the details of the documented semantics of Spark's DataFrameWriter V1 // saveAsTable API differs from that of CREATE/REPLACE TABLE AS SELECT, Spark should // not be reusing the exact same logical plan for these APIs. // Existing Datasources which have been implemented following Spark's documentation of // these APIs should have a way to differentiate between these APIs.

Why don't we just always append the option? The downstream datasources who care about this behaviour will make the change accordingly.

Add isDataFrameWriterV1 option for Delta datasource compatibility

1e6191f

github-actions bot added the SQL label Nov 22, 2025

juliuszsompolski mentioned this pull request Nov 22, 2025

[Spark] Check isDataFrameWriterV1 option for V1/V2 differentiation delta-io/delta#5560

Open

5 tasks

dongjoon-hyun requested changes Nov 22, 2025

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-54462] Add isDataFrameWriterV1 option for Delta datasource compatibility~~ [SPARK-54462][SQL] Add isDataFrameWriterV1 option for Delta datasource compatibility Nov 22, 2025

dongjoon-hyun reviewed Nov 23, 2025

View reviewed changes

HyukjinKwon reviewed Nov 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54462][SQL] Add `isDataFrameWriterV1` option for Delta datasource compatibility #53173

[SPARK-54462][SQL] Add `isDataFrameWriterV1` option for Delta datasource compatibility #53173

juliuszsompolski commented Nov 22, 2025

Uh oh!

juliuszsompolski commented Nov 22, 2025

Uh oh!

dongjoon-hyun left a comment •

edited

Loading

Uh oh!

dongjoon-hyun Nov 23, 2025

Uh oh!

HyukjinKwon Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-54462][SQL] Add isDataFrameWriterV1 option for Delta datasource compatibility #53173

Are you sure you want to change the base?

[SPARK-54462][SQL] Add isDataFrameWriterV1 option for Delta datasource compatibility #53173

Conversation

juliuszsompolski commented Nov 22, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

juliuszsompolski commented Nov 22, 2025

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-54462][SQL] Add `isDataFrameWriterV1` option for Delta datasource compatibility #53173

[SPARK-54462][SQL] Add `isDataFrameWriterV1` option for Delta datasource compatibility #53173

dongjoon-hyun left a comment •

edited

Loading